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ARE A SET OF MICROARRAYS INDEPENDENT 
OF EACH OTHER? 1 

By Bradley Efron 
Stanford University 

Having observed an m x n matrix X whose rows are possibly cor- 
related, we wish to test the hypothesis that the columns are indepen- 
dent of each other. Our motivation comes from microarray studies, 
where the rows of X record expression levels for m different genes, 
often highly correlated, while the columns represent n individual mi- 
croarrays, presumably obtained independently. The presumption of 
independence underlies all the familiar permutation, cross-validation 
and bootstrap methods for microarray analysis, so it is important 
to know when independence fails. We develop nonparametric and 
normal-theory testing methods. The row and column correlations of 
X interact with each other in a way that complicates test procedures, 
essentially by reducing the accuracy of the relevant estimators. 

1. Introduction. The formal statistical problem considered here can be 
stated simply: having observed an m x n data matrix X with possibly cor- 
related rows, test the hypothesis that the columns are independent of each 
other. Relationships between the row correlations and column correlations 
of X complicate the problem's solution. 

Why are we interested in column-wise independence? The motivation in 
this paper comes from microarray studies, where X is a matrix of expres- 
sion levels for m genes on n microarrays. In the "Cardio" study I will use 
for illustration there are m = 20,426 genes each measured on n = 63 arrays, 
with the microarrays corresponding to 63 subjects, 44 healthy controls and 
19 cardiovascular patients. 2 We expect the gene expressions to be corre- 
lated, inducing substantial correlations within each column [Owen (2005), 
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^The entries of X are log(red/green) ratios obtained from oligonucleotide arrays. 
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Efron (2007a), Qiu et al. (2005a)], but most of the standard analysis tech- 
niques begin with an assumption of independence across microarrays, that 
is, across the columns of X. This can be a risky assumption: all of the fa- 
miliar permutation, cross-validation and bootstrap methods for microarray 
analysis, such as the popular SAM program of Tusher, Tibshirani and Chu 
(2001), depend on column-wise independence of X; dependence can inval- 
idate the usual choice of a null hypothesis, as discussed next, leading to 
flawed assessments of significance. 

An immediate purpose of the Cardio study is to identify genes involved 
in the disease process. For gene i we compute the two-sample t-statistic "V 
comparing sick versus healthy subjects. It will be convenient for discussion 
to convert these to ^-scores, 

(1.1) z i = ^ 1 (Fei(t i )), i = l,2,...,m, 

with $ and Fqi the cumulative distribution functions (c.d.f.) of standard 
normal and t^i distributions; under the usual assumptions, Zj will have a 
standard A^(0, 1) null distribution, called here the "theoretical null." Unusu- 
ally large values of Zi or — Zi are used to identify nonnull genes, with the 
meaning of "unusual" depending heavily on column-wise independence. 

The left panel of Figure 1 shows the histogram of all 20,426 Zi values, 
which is seen to be much wider than iV(0, 1) near its center. An "em- 
pirical null" fit to the center, as in Efron (2007b), was estimated to be 
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Fig. 1. Left panel: histogram of m = 20,426 z-values (1.1) for Cardio study; center of 
histogram is much wider than N(0, 1) theoretical null. Right panel: scatterplot of microar- 
rays 31 and 32, (3^31,2:132) for i = 1,2, ... ,m, after removal of row-wise gene means; the 
scattergram seems to indicate substantial correlation between the two arrays. 
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iV(0.03, 1.57 2 ). Null over dispersion has many possible causes [Efron (2004, 
2007a, 2007b)], one of which is positive correlation across the columns of X. 
Such correlations reduce the effective degrees of freedom for the i-statistic, 
causing (1.1) to yield overdispersed null ZiS, and of course changing our 
assessment of significance for outlying values. 

The right panel of Figure 1 seems to offer a "smoking gun" for correla- 
tion: the scattergram of expression levels for microarrays 31 and 32 looks 
strikingly correlated, with sample correlation coefficient 0.805. Here X has 
been standardized by subtraction of its row means, so the effect is not due 
to so-called ecological correlations. (X is actually "doubly standardized," 
as defined in Section 2). Nevertheless, the question of whether or not cor- 
relation 0.805 is significantly positive turns out to be surprisingly close, as 
discussed in Section 4, because the row-wise correlations in X drastically 
reduce the degrees of freedom for the scatterplot. Despite the massive ap- 
pearance of 20,426 points, the scattergram's accuracy is no more than would 
be given by 17 independent bivariate normal pairs. 

Answering the title's question, that is, testing for column-wise indepen- 
dence in the presence of row-wise dependence, has both easy and difficult 
aspects. Section 2 introduces a class of simple permutation tests which, in 
the case of the Cardio data, clearly discredit column-wise independence. 
However, these tests depend on the ordering of the n columns, and can't 
be used if the initial order is lost. It is natural and desirable to look for 
test statistics of column-wise independence that are invariant under per- 
mutation of the columns. Classical multivariate analysis, as in Anderson 
(2003), develops column independence tests in terms of the eigenvalues of 
an n by n Wishart matrix. However, this theory depends on the assumption 
of row-wise independence, disqualifying it for use here. 

Sections 3 through 5 consider more general classes of independence tests, 
both from nonparametric and normal theory points of view. The theorem 
in Section 3 illustrates a key difficulty: correlation between the rows of X 
(ruled out in the classic theory) can give a misleading appearance of column- 
wise dependence. Similarly, row-wise dependence can greatly degrade the 
accuracy of the usual n x n sample covariance matrix of the columns, as 
shown by the theorem in Section 4. Various nonpermutation normal-theory 
tests are discussed in Section 5, some promising, but with difficulties seen 
for all of them. The paper ends in Section 6 with a collection of remarks and 
details. 

2. Permutation tests of column-wise independence. Simple permutation 
tests can provide strong evidence against column-wise independence, as we 
will see for the Cardio data. Our main example concerns the 44 healthy 
subjects, where X is now an m x n matrix with m = 20,426 and n = 44. For 
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convenience, we assume that X has been "demeaned" by the subtraction of 
row and column means, giving 

(2.1) Xij = ^ Xij = for i = 1, 2, . . . , m and j = 1, 2, . . . , n. 

* i 

Our numerical results go further and assume "double standardization" : that 
in addition to (2.1), 

(2.2) 22 x ij = n and }] %f,- = m for % = 1, . . . ,m and j = 1, . . . ,n, 

j i 

that is, that each row and column of X has mean and variance 1; see 
Remark 6.4 in Section 6. 

Let A be the familiar estimate of the nx n covariance matrix A between 
the columns of X , 

(2.3) A = (X'X)/m. 

Under double standardization, A is actually the sample correlation ma- 
trix, which we expect to be near the identity matrix I n under column-wise 
independence. Also let v± denote the first eigenvector of A. The left panel of 
Figure 2 plots the components of v\ versus array number 1,2,..., 44. Suppose 
that the columns of the original expression matrix, before standardization, 
are independent and identically distributed m- vectors ("i.i.d."). Then it is 
easy to see (Remark 6.2 of Section 6) that all orderings of the components 
of v\ are equally likely. This is not what Figure 2 shows: the components 
seem to increase from left to right, with a noticeable block of large values 
for arrays 27-32. 

Let S{v\) be a statistic that measures structure, for instance, a linear 
regression of v\ versus array index. Comparing S{v{) with a set of permuted 
values 

(2.4) {S* l = S(v* 1 ), Z = 1,2,...,L}, 

v* 1 a random permutation of the components of v±, provides a quick test of 
the i.i.d. null hypothesis. 

Permutation testing was applied to v± for the Cardio data, using the 
"block" statistic 

(2.5) S(v 1 ) = v' 1 Bv 1 , 
where B is the n x n matrix 

(2.6) B = J2PhPl 

h 

The sum in (2.6) is over all vectors (5h of the form 

(2.7) & = (0,0,. .. ,0,1,1,. .. ,1,0,0,. ..,0), 
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Fig. 2. Left panel: Components of first eigenvector of row sample correlation matrix for 
the 44 healthy Cardio subjects, plotted versus array number 1,2, ... ,44; dashes emphasize 
the block of large components for arrays 27-32. Right panel: First eigenvectors for healthy 
(solid line) and cancer (dashed) subjects, prostate cancer study, Singh et al. (2002); there 
was a systematic drift in expression levels as the study progressed. 



with the Is forming blocks of length between 2 and 10 inclusive. A heuristic 
rationale for block testing appears below; intuitively, microarray experiments 
are prone to block disturbances because of the way they are developed and 
read; see Callow et al. (2000). After L = 5000 permutations, only three S* 
values exceeded the actual value S(v±), p-value 0.0006, yielding strong evi- 
dence against the i.i.d. null hypothesis. 

The right panel of Figure 2 pertains to a microarray prostate cancer study 
[Singh et al. (2002)] discussed in Efron (2008): m = 6033 genes were mea- 
sured on each of n = 102 men, 50 healthy controls and 52 prostate cancer 
patients. The right panel plots first eigenvectors for A, (2.3), computed sep- 
arately for the healthy controls and the cancer patients (the two matrices 
being individually doubly standardized). Both vectors increase almost lin- 
early from left to right. Taking S(v\) as the linear regression of V\ versus 
array number, permutation testing overwhelmingly rejected the i.i.d. null 
hypothesis, as it also did using the block test. The prostate study appears 
as a favorable example of microarray technology in Efron (2008). Neverthe- 
less, Figure 2 indicates a systematic drift in the expression level readings as 
the study progressed. Some genes drift up, others down (the average drift 
equaling because of standardization) , inducing a small amount of column- 
wise correlation. 



G 



B. EFRON 



Section 5 discusses models for X where the n x n column covariance 
matrix A is of the "single degree of freedom" form 

(2.8) A = 7 + A/3/3' 

for some known fixed vector j3, the null hypothesis of column-wise indepen- 
dence being Hq : A = 0. An obvious choice of test statistic in this situation 
is 

(2.9) Sp = f3'(A-I)(3, 

a monotone increasing function of (5'Af3. If (3 is unknown, we can replace Sp 
with 

(2.10) S B = Y / P'h^f3h = tr(AY / PhA) =ti(AB), 

h=l ^ h ' 

where {/3i,/?2, ■ • ■ ,@h} is a catalog of "likely prospects" as in (2.7). 

Permutation test statistics such as (2.5) can be motivated from the sin- 
gular value decomposition (SVD) of X, 

(2.11) X = U d V , 

rnxn mxK KxK Kxn 

where K is the rank, d the diagonal matrix of ordered singular values, and 
U and V orthonormal matrices of sizes m x K and n x K, 

(2.12) u'U = V'V = I K , 

Ik the KxK identity. The squares of the diagonal elements, say, 

(2.13) e 1 >e 2 >->e K >0 {e k = d 2 k ), 

are the eigenvalues of X'X = V'd 2 V. 
Sb in (2.10) can now be written as 

(2.14) ^ = E|(^)- 

j'=i 

Model (2.8) suggests that most of the information against the null hypothesis 
Hq of independence lies in the first eigenvector vi, getting us back to test 
statistic S(vi) =v[Bvi as in (2.5). 

What should the statistician do if column-wise independence is strongly 
rejected, as in the Cardio example? Use of an empirical null rather than a 
permutation or theoretical null, AA(0.03, 1.57 2 ) rather than W(0, 1) in Figure 
1, removes the reliance on column-wise independence for hypothesis testing 
methods such as False Discovery Rates, at the expense of increased variabil- 
ity. Efron (2008) discusses these points. 
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Two objections can be raised to our permutation tests: (1) they are really 
testing i.i.d., not independence; (2) nonindependence might not manifest 
itself in the order of v\ (particularly if the order of the microarrays has been 
shuffled in some unknown way). 

Column-wise standardization makes the column distributions more sim- 
ilar, mitigating objection (1). Going further, "quantile standardization"— 
say, replacing each column's entries by normal scores [Bolstad et al. (2003)]- 
makes the marginals exactly the same. The Cardio data was reanalyzed using 
normal scores, with almost identical results. 

Objection (2) is more worrisome from the point of view of statistical 
power. The order in which the arrays were obtained should be available 
to the statistician, and should be analyzed to expose possible trends like 
those in Figure 2. 3 It would be desirable, nevertheless, to have independence 
tests that do not depend on order — that is, test statistics invariant under 
column-wise permutations. The remainder of this paper concerns both the 
possibilities and difficulties in the development of "nonpermutation" tests. 

3. Row and column correlations. There is an interesting relationship 
between the row and column correlations of the matrix X, which complicates 
the question of column-wise independence. For the notation of this section 
define the n x n matrix of sample covariances between the columns of X as 



for the m x m matrix of row-wise sample covariances (having more than 
400,000,000 entries in the Cardio example!). 

Theorem 1. If X has row and column means 0, (2.1), then the n 2 
entries o/Cov have empirical mean and variance C2, 

K 




Cov = X'X/m, 



called A in Section 2, and likewise 



(3.2) 



cav = XX'/n, 



(3.3) 




k=l 



with eu the eigenvalues (2.13), and so do the m 2 entries o/cov. 



Proof. The sum of Cov's entries is 



(3.4) 



l' n X'X\ n /m = 0, 



3 The referee points out that when Affymetrix CEL files are available, array run dates 
will usually be found in the DatHeader lines. 
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according to (2.1), while the mean of squared entries is 

E]=iEr=iC^v 2 jr _ tr{(x>xf) _ tr(yW) _ 

(o.Oj „ — 2 2 — 2 2 — C2 ' 

Replacing X'X with XX' yields the same results for the row covariances 
cov. □ 

Under double standardization (2.1)~(2.2), the covariances become sample 
correlations, say, Cor and cor for the columns and rows. Theorem 1 has a 
surprising consequence: whether or not the columns of X are independent, 
the column sample correlations will have the same mean and variance as the 
row correlations. In other words, substantial row-wise correlation can induce 
the appearance of column-wise correlation. 

Figure 3 concerns the 44 healthy subjects in the Cardio study, with X 
an (m,n) = (20,426,44) doubly standardized matrix. All 44 2 column cor- 
relations are shown by the solid histogram, while the line histogram is a 
random sample of 10,000 row correlations. Here C2 = 0.283 2 , so according to 
the theorem, both histograms have mean and standard deviation 0.283. 

The 44 diagonal elements of Cor protrude as a prominent spike at 1. (We 
can not see the spike of 20,426 diagonal elements for the row correlation 
matrix cor because they form such a small fraction of all 20,426 2 .) It is easy 
to remove the diagonal l's from consideration. 

Corollary. In the doubly standardized situation, the off-diagonal ele- 
ments of the column correlation matrix Cor have empirical mean and vari- 
ance 

(3.6) p. = and a 2 = ( C2 

Ti — l n — 1 \ Ti — l 



For n = 44 and C2 = 0.283 this gives 

(3.7) (/i,Q 2 ) = (-0.023, 0.241 2 ). 

The corresponding diagonal-removing corrections for the row correlations 
[replacing n by m in (3.6)] are negligible for m = 20,426. However, C2 over- 
estimates the variance of the row correlations for another reason: with only 
44 points available to estimate each correlation, estimation error adds a con- 
siderable component of variance to the cof histogram in the left panel, as 
discussed next. 

Suppose now that the columns of X are in fact independent, in which 
case the substantial column correlations seen in Figure 3 must actually be 
induced by row correlations, via Theorem 1. Let corn/ indicate the true 
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Column {solid hist) and row correlations Adjusted Column and row correlations 

both have mean=0 stdev=.283 both have stdev .241 



Fig. 3. Left panel: solid histogram the 44 2 column sample correlations for X the doubly 
standardized matrix of healthy Cardio subjects; line histogram is sample of 10,000 of the 
20.426 2 row correlations. Right panel: solid histogram the column correlations excluding 
diagonal Is; line histogram the row correlations corrected for sampling overdispersion. 



correlation between rows i and i! (i.e., between Xij and X^j), and define a 
the total correlation to be the root mean square of the corn' values, 

(3-8) * 2 = E™rh/(i). 

Remark 6.5 of Section 6 shows that a 2 in (3.6) is an approximately unbiased 
estimate of a 2 , assuming column- wise independence. For the Cardio exam- 
ple a = 0.241, similar to the size of the microarray correlation estimates in 
Efron (2007a), Owen (2005) and Qiu et al. (2005a). Section 4 discusses the 
crucial role of a in determining the accuracy of estimates based on X. 

The right panel of Figure 3 compares the histogram of the column correla- 
tions Corjji , now excluding cases j = j' , with the row correlation histogram 
corrected for sampling overdispersion via the shrinkage factor 0.0241/0.283. 
As predicted by Theorem 1, the similarity is striking. A possible difference 
lies in the long right tail of the Cor distribution (including Cor^i^, the 
case illustrated in Figure 1), whose significance is examined in Section 4. 

4. Normal theory. The results of Sections 2 and 3 were developed non- 
parametrically. This section concerns multivariate normal theory, afterward 
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used in Section 5 to draw the connection with classical multivariate inde- 
pendence tests. We consider the matrix normal distribution for X, 

(4.1) X ~JV m ,„(0, p ® A ), 

mxn mxm nxn 

where the Kronecker notation indicates covariance structure 

(4.2) cov(Xij,Xifj>) = fin'Ajf. 
Row Xi of X has covariance matrix proportional to A, 

(4.3) Xi~A/" n (0,^A) 

{not independently across rows unless J£ is diagonal), and likewise for column 
Xj, Xj ~ A/" m (0, Ajj'fi). As in (2.1), we take all means equal to 0. 

Much of classical multivariate analysis focuses on the situation ^ = I, 
where the rows Xi are independent replicates, 4 

(4.4) £ = /:x l i '~-A/' n (0,A), i = l,2,...,m, 

in which case the sample covariance matrix A = X'X/m has a scaled Wishart 
distribution, 

(4.5) A ~ Wishart(?n, A) /m. 
Distribution (4.5) has first and second moments 

(4.6) A ~ ( A , AS 2 \ /to) with Af k \ h = A jt A kh + A jh A kl 

nxn nxn n 2 xn 2 

for j, k, I, h = 1, 2, . . . , n; see Mardia, Kent and Bibby [(1979), page 92]. 

Relation (4.6) says that when f} = I, that is when the rows of X are inde- 
pendent, A unbiasedly estimates the row covariance matrix A with accuracy 
proportional to m _1//2 . Correlation between rows reduces the accuracy of A, 
as shown next. 

Returning to the general situation (4.1)-(4.3), define 

(4.7) A = X'a- 2 X/m, 

1/2 

where er is the diagonal matrix with diagonal entries a, = ^/ . 

Theorem 2. Under model (4-1), A has first and second moments 

(4.8) A~ (A,A( 2) /m), m = m/[l + (to — l)a ], 



4 Most multivariate texts reverse the situation, taking the columns as independent repli- 
cas of possibly correlated rows. 
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where a is the total correlation as in (3.8), 

(4.9) c?=Y$&rM*<')/ 

and A^ 2 ) is the Wishart covariance (4-6). 

Comparing (4.8) with (4.6), we see that correlation between the rows 
reduces "effective sample size" from mtom: for a = 0.241 as in (3.7), the 
reduction is from m = 20,426 to m = 17.2! (Notice that row standardization 
effectively makes <7j = 1 in (4.7), so A = A (2.3), justifying the comparison.) 
The total correlation a shows up in other efficiency calculations; see Remark 
6.7. 




Proof of Theorem 2. The row-standardized matrix X = a 1 X has 
matrix normal distribution 

(4.10) X~A r m ,re(0, ^<8> A), 

where = <t _1 ^<t _1 has diagonal elements JE^ = 1. From (4.2) we see that 
^/ = ^^//(^jj^lj'j') 1 / 2 is the correlation between elements Xij and X^ij in 
the same column of X; A = X'X/m has entries Ajk = J^j^XijXik/m, and is 
unbiased for A, 

(4.11) E{A jk } = A jk , 
using (4.2). 

The covariance calculation for A involves expansion 



(4.12) A jk A lh = 
(4.13) 




Using the formula 

(4.14) E{Z 1 Z 2 Z 3 Z 4 } = 712734 + 713724 + 714723 

for a normal vector (Z1Z2Z3Z4)' with means and covariances 7^, (4.2) 
gives 

(4.15) E\Y^ XijXikXuxA = m[A jk A lh + A fl A kh + A jh A kl ] 



12 



B. EFRON 



and 

Ei XijXikXinXvh \ = m(m - l)A jk A lh 

(4-16) ^ 

+ (A j iA kh + A jh A kl )^2^ iit . 

Then (4.13) yields 

~~ / 1 + (m — Da' 

(4.17) E{A jk A lh } = A jk A lh + (A^A^ + A jh A kl ) 1 



rn 



giving 

(4.18) cov(A ifc , A lh ) = (A jt A kh + A jh A kl )/rh, 
as in (4.8). □ 

A corollary of Theorem 2, used in Section 5, concerns bilinear functions 
of A and A, 

(4.19) t 2 = w'Aw and f 2 = w'Aw, 
where w is a given n- vector. 




-0.5 0.0 0.5 

Column-wise correlations 



Fig. 4. Dashed curve is normal-theory null density for correlation coefficient from 
fh = 17.2 pairs of points; see Remark 6.6. Histogram is the 946 column correlations, right 
panel Figure 3. FDR test, q = 0.1, yielded 7 significant correlations, Cor > 0.723, including 
0.805 between arrays 31 and 32, Figure 1. 
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Corollary. Under model (4-1), t 2 has mean and variance 

(4.20) f 2 ~(T 2 ,2r 4 /m). 

The proof follows that for Theorem 2; see Remark 6.9. 

If ^ = / in (4.1), then A = A and f 2 has a scaled chi-squared distribution, 

(4.21) f 2 ~ T 2 -xl/m, 

with mean and variance f 2 ~ (r 2 ,2r /m), so again the effect of correlation 
within fi is to reduce the effective sample size from mtom (4.8). 
We can approximate A (4.7), with 

(4.22) A = X'a~ 2 X/m, 

where is an estimate of fin based on the observed variability in row i. 
If the rows of X have been standardized, then a\ = 1 and A returns to its 
original definition X'X/m. 

Both Theorem 2 and the Corollary encourage us to think of A as, ap- 
proximately, a scaled Wishart distribution based on an independent sample 
of size m, 

(4.23) A ~ Wishart (m,A)/m. 

The dangers of this approximation are discussed in Section 5, but it is, 
nevertheless, an evocative heuristic, as shown below. 

Figure 4 returns to the question of the seemingly overwhelming correlation 
0.805 between arrays 31 and 32 seen in Figure 1. A one-sided p-value was 
calculated for each of the 946 column correlations, using as a null hypothesis 
the normal theory correlation coefficient distribution based on a sample size 
of rh = 17.2 pairs of A/^O, /) points [the correct null if A = / in (4.23)]. 
Benjamini and Hochberg's (1995) False Discovery Rate test, level q = 0.1, 
was applied to the 946 p-values. This yielded 7 significant cases, those with 
sample correlation > 0.723; all 7 were from the block of arrays 27 to 32 
indicated in Figure 2. Correlation 0.805 does turn out to be significant, but 
by a much closer margin than Figure l's scattergram suggests. 

The FDR procedure was also applied using the simpler null distribution 
AA(-0.023,0.241 2 ) (3.7). This raised the significance threshold from 0.723 to 
0.780, removing two of the previously significant correlations. 

Theorem 1 showed that the variance of the observed column correlations 
is useless for testing column-wise independence, since any value at all can be 
induced by row correlations. The test in Figure 4 avoids this trap by looking 
for unusual outliers among the column correlations. It does not depend on 
the order of the columns, objection (2) in Section 2 for permutation tests, 
but pays the price of increased modeling assumptions. 
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5. Other test statistics. Theorem 2 offers a normal-theory strategy for 
testing column- wise independence. We begin with X ~ A/" mi n(0, fi<g> A) (4.1), 
taking 

(5.1) pa = 1 and Ajj = 1 for all i and j, 

as suggested by double standardization. The null hypothesis of column-wise 
independence is equivalent to the column correlation matrix A equaling the 
identity, 

(5.2) H :A = I, 

since then (4.2) says that all pairs in different columns are independent. 

To test (5.2), we estimate A with A, (4.22) or more simply A = X'X/m 
after standardization, and compute a test statistic 

(5.3) S = s(A), 

where s(-) is some measure of distance between A and /. The accuracy 
approximation A ~ (A, A^ /m) from (4.8), with A = /, is used to assess the 
significance level of the observed S, maybe even employing the more daring 
approximation A ~ Wishart(m, I)/m. Strategy (5.3) looks promising but, as 
the examples of this section will show, it suffers from serious difficulties that 
are absent under the classic assumption of independent rows. 

One of the difficulties stems from Theorem 1. An obvious test statistic 
for H : A = I is 

(5-4) s=^lr/{ n X 

3<j' \ / 

the average squared off-diagonal element of A. But A = Cov (3.1), so in the 
doubly standardized situation of (3.6), S is an increasing monotone function 
of a, the estimated total correlation. This disqualifies S as a test statistic for 
(5.2), since large values of a can always be attributed to row- wise correlation 
alone. 

Similarly, the variance of the eigenvalues (2.13), 

K 

(5.5) S = ]T(e fc -e.) 2 A 

fe=i 

looks appealing since the true eigenvalues all equal 1 when A = I. However, 
(5.5) is also a monotonic function of a; see Remark 6.1. 

The general difficulty here is "leakage," the fact that row-wise correlations 
affect the observed pattern of column- wise correlations. This becomes clearer 
by comparison with classical multivariate methods, where row-wise corre- 
lations are assumed away by taking 'p = I in (4.1). Johnson and Graybill 
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(1972) consider a two-way ANOVA problem where, after subtraction of main 
effects, X has the form 

(5.6) Xij = di(3j + Sij for i = 1, 2, . . . , m and j = 1, 2, . . . , n, 

aj ~ A/"(0, A) and ~ A/"(0, 1), all independently, with /3 = (/3i , /?2 , • • • , /3 n ) 
a fixed but unknown vector (representing "one degree of freedom for non- 
additivity" in the two-way table X, Johnson and Graybill's extension of 
Tukey's procedure). 

In the Kronecker notation (4.1), X ~ A/" min (0, / <g) A) with 

(5.7) A = / + A/3/3'. 

Now (5.2) becomes Hq : A = 0. Johnson and Graybill show that, with (5 un- 
known, the likelihood ratio test rejects Hq for large values of the eigenvalue 
ratio (2.13), 

K 

(5.8) S = ei/]Te fc . 

k=l 

Since the m rows of X are assumed independent, they can test Hq by com- 
parison of S with values S* = e*/J2k=i e *k obtained from 

(5.9) A* ~ Wishart(?n, I) /m, 
as in (4.5). 

Getting back to the correlated rows situation, Theorem 2 suggests com- 
paring S with values S* from 

(5.10) A* ~ Wishart(m, i") /rh, 

m as in (4.8). The solid histogram in Figure 5 compares 100 S* values from 

(5.10) , m = 17.2 for the Cardio data, with the observed value S = 0.207 
from the doubly standardized Cardio matrix for the healthy subjects used 
in Figure 3. All 100 S* values are much smaller than S, providing strong 
evidence against Hq : A = I. 

The evidence looks somewhat weaker, though, if we simulate S* values 
with A* obtained from random matrices 

(5.11) A*~A/" 20 ,426,44(0,£®/), 

doubly standardized, where has total correlation a = 0.241, the estimated 
value for X, (4.9). The line histogram in Figure 5 shows 100 such S* values, 
all still smaller than S, but substantially less so. (Remark 6.8 describes the 
construction of X* .) 

Why does (5.11) produce larger "null" S* values than (5.10)? The answer 
is simple: even though the first and second moments of A* = X*' X* /m 
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o - 1 



I 




0.12 



0.16 



0.18 



0.20 



Eigenratios S and S* 



Fig. 5. Eigenratio statistic (5.8) equals 0.207 for 20,426 x 44 Cardio matrix X; solid 
histogram 100 simulations S* from Wishart (5.10), fh = 17.2; line histogram 100 simula- 
tions from correlated-row X* matrices (5.11), a = 0.241, A = I. 

match A* from (5.10), its eigenvalues do not. The nonzero eigenvalues of 
X*'X* /m equal those offi= X*X*' jn. This is another example of leakage, 
where the fact that ^ in (5.11) is not the identity I m distorts the estimated 
eigenvalue of A*, even if A = I n . 

The eigenratio statistic S = ei/J2 e k is invariant under permutations of 
the columns of X, answering objection (2) to permutation testing of Section 
2. Because of invariance, the eigenratio and permutation tests provide inde- 
pendent p-values for testing the null hypothesis of i.i.d. columns, and so can 
be employed together. Figure 5 is disturbing nonetheless, in suggesting that 
an appropriate null distribution for S depends considerably on the choice of 
the nuisance parameter ^ in (5.11). 

The bilinear form (4.19)-(4.20) yields another class of test statistics, 



where w is a pre-chosen n-vector and r 2 = w'Aw. Delta-method arguments 



(5.12) 





yields the alternative form 



rn 



(5.14) 




i=l 
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In a two-sample situation like that for the Cardio study, sample sizes n\ 
and r&2, we can choose 

1 /2 

(5.15) w' = ( nim ) (-l ni /ni,l„ 2 /n 2 ), 

Vni + n 2 / 

"l n " indicating a vector of n l's. This choice makes 

( n i n 2 \ 1 ^ 2 ,- 

(5.16) Zi=[ ■ {X2i-x li ), 



\ni + n 2/ 

the multiple of the mean response difference between the two samples that 
has variance 1 if A = 7. In terms of (5.12), ||w|| 2 = 1 so r 2 = 1. 

For the Cardio study, with n\ = 44,?i2 = 19, and m = 17.2, we obtain 
f = 1.48, coefficient of variation 0.17. This puts f more than 2.8 standard er- 
rors above the null hypothesis value r = 1, again providing evidence against 
column-wise independence. The Zi values from (5.16) are nearly indistin- 
guishable from the Z{ values in Figure 1 — not surprisingly since with the 
rows of X standardized, Zi is an equivalent form of the two-sample t-statistic 
U in (1.1). 

Once again, however, there are difficulties with this as a test for column- 
wise independence. There is no question that the Z^s are overdispersed 
compared to the theoretical value r = 1. But problems other than column 
dependence can cause overdispersion, in particular unobserved covariate dif- 
ferences between subjects in the two samples [Efron (2004, 2008)]. 

The statistic S = w'Aw in (5.15) does not depend upon the order of the 
columns of X within each of the two samples, answering objection (2) against 
permutation tests, but it is the only such choice for a two-sample situation. 
Other u>'s might yield interesting results. The version of (5.15) comparing the 
first 22 healthy Cardio subjects with the second 22 provided the spectacular 
value f = 1.87, and here the "unobserved covariate" objection has less force. 

Now, however, the test statistic depends on the order of the columns 
within the healthy subjects' matrix, reviving objection (2). Again we might 
want to check a catalog of possible w vectors Wi, W2, ■ ■ ■ ,wh, leading back 
to test statistic 

(5.17) S B = Y, v h&u>h = tr(AB) (s = ^vi), 

h ^ h ' 

as in (2.10), the only difference being that the null distribution of A now 
involves normal theory rather than permutations. Remark 6.9 shows that 
the null first and second moments of Sb are similar to (5.12), 

(5.18) S- B ~ftr(B),-?-tr(W 



In summary, normal-theory methods are interesting and promising, but 
are not yet proven competitors for the permutation tests of Section 2. 
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6. Remarks. This section presents some brief remarks and details sup- 
plementing the previous material. 

Remark 6.1 (The constant C2). The variance constant C2 in Theorem 
1 (3.3) can be expressed as 

so that C2 > K(e/mn) 2 , with equality only if the eigenvalues e& are equal. 
In the doubly standardized case e = mn/K, giving 

(6.2) c 2 > 1/K, 

where K is the rank of X . 



(6.1) 



K 



C-2 



(mn)' 



K 

k=i 



Remark 6.2 (Permutation invariance). If the columns of X are i.i.d. 
observations from a distribution on M m , then the distribution of X is invari- 
ant under permutations: Xtt ~ X for any n x n permutation matrix ir. Now 
suppose X = L(X), where L performs the same operation on each column 
of X, for example, replacing each column by its normal scores vector. Then 

(6.3) Xtt = L(X)tv = L(Xtv) ~ L{X) = X, 

showing that X is permutation invariant. 

Similarly, suppose X = R(X), R performing the same operation X{ = 
r(Xi) on each row of X, where now we require r(x)7v = r(xn) for all n- 
vectors x. The same argument as (6.3) demonstrates that X is still permu- 
tation invariant. Iterating row and column standardizations as in Table 1 
then shows that if the original data matrix X is permutation invariant, so 
is its doubly standardized version. 

Remark 6.3 (Covariances after demeaning). Suppose that X is nor- 
mally distributed, with covariances fl ® A (4.2), all columns having the 
same expectation vector \i. Let X be the demeaned matrix obtained by 
subtracting all the row and column means of X. Then 

(6.4) J~AT m , n (0,^ A), 
where 

(6.5) A jj v = A ij .-A. j ,-A J -. + A.., 

dots indicating averaging over the missing subscripts, and similarly for £3. 
This shows that demeaning tends to reduce covariances by recentering them 
around 0. 
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Table 1 

Successive row and column standardizations of the 20,426 x 44 matrix of healthy Cardio 
subjects. "Col" empirical standard deviation of Corjji, j < j' ; "Eig" a from (3.6); 
"Row" from 1% sample of corn/ values, adjusted for overdispersion (6.6), sampling 

standard error 0.0034 





Col 


Row 


Eig 




Col 


Row 


Eig 


Demeaned 


0.252 


0.286 


0.000 


Demeaned 


0.252 


0.286 


0.000 


Col 


0.252 


0.249 


0.251 


Row 


0.241 


0.283 


0.279 


Row 


0.242 


0.255 


0.246 


Col 


0.241 


0.251 


0.240 


Col 


0.242 


0.241 


0.242 


Row 


0.240 


0.247 


0.241 


Row 


0.241 


0.246 


0.235 


Col 


0.240 


0.247 


0.240 


Col 


0.241 


0.244 


0.241 


Row 


0.241 


0.240 


0.235 


Row 


0.241 


0.245 


0.234 


Col 


0.241 


0.237 


0.240 


Col 


0.241 


0.238 


0.241 


Row 


0.241 


0.233 


0.233 



Remark 6.4 (Standardization). A matrix X is "column standardized" 
by individually subtracting the mean and dividing by the standard deviation 
of each column, and similarly for row standardization. Table 1 shows the 
effect of successive row and column standardizations on the 20,426 x 44 
demeaned matrix of healthy Cardio subjects. Here "Col" is the empirical 
standard deviation of the 946 column-wise correlations Corjji, j <j'; "Eig" 
is a in (3.6); and "Row" is the empirical standard deviation "/?" of a 1% 
sample of the row correlations corn', but adjusted for overdispersion, 

(6.6) 1^= " (>_ J_\ 

n — 1 \ n — lj 

Sampling error of the Row entries is about ±0.0034. 

The doubly standardized matrix X used for Figure 3 was obtained after 
five successive column-row standardizations. This was excessive; the figure 
looked almost the same after two iterations. Other microarray examples 
converged equally rapidly, though small counterexamples can be constructed 
where double standardization is not possible. 

Microarray analyzes usually begin with some form of column-wise stan- 
dardization [Bolstad et al. (2003), Qiu, Klebanov and Yakovlev (2005b)], 
designed to negate "brightness" differences between the n microarrays. In 
the same spirit, row standardization helps prevent incidental gene differences 
(e.g., very great or very small expression level variabilities) from obscuring 
the actual effects of interest. Standardization tends to reduce the apparent 
correlations as in Remark 6.3. Without standardization, the scatterplot in 
Figure 1 stretches out along the main diagonal, correlation 0.917, driven by 
genes with unusually large or small inherent expression levels. 

Remark 6.5 (Corrected estimates of the total correlation). Suppose 
that the true row correlations corn/ have mean and variance a 2 , as in 
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(3.8) with cor = 0, and that given corn', the usual estimate corn' has mean 
and variance 



(6.7) 



cofi,i' = [corn', (1 - cor iV ) /(n- 3)], 



(6.7) being a good normal-theory approximation [Johnson and Kotz 1970, 
Chapter 32]. Letting a 2 be the empirical variance of the cor a' values, a 
standard empirical Bayes derivation yields 



(61 



a' 



A 



-A 4 



A 



(n — 3)c^ 



1 



as an approximately unbiased estimate of a 2 . (If cor is not assumed to 
equal 0, a slightly more complicated formula applies.) Of course, a 2 = if 
the right-hand side of (6.8) is negative. 

Theorem 1 implies that a 2 nearly equals C2, (3.3), in the doubly stan- 
dardized situation. Formula (3.6), with, say, 



(6.9) 



1 



1 



a 



n 



1 



is not identical to (6.8), but provides an excellent approximation for values of 
a < 0.5: with n = 44 and a = 0.283 as in (3.6), a = 0.2415 while a = 0.2412. 



Remark 6.6 (Column and row centerings). The column correlation 
mean ft = — l/(n — 1) in (3.6) is forced by the row-wise demeaning ^ - Xij = 0, 
(2.1), centering the solid histogram in the right panel of Figure 3 at —0.023. 
With m = 20,426, the corresponding center for the line histogram is nearly 
0, and the difference in the two centerings is noticeable. The dashed density 
curve in Figure 4, and the corresponding p- values for the FDR analysis, were 
shifted 0.023 units leftward. 

Remark 6.7 (The total correlation a). The total correlation a, which 
plays a key role in Theorem 2, (4.9), also is the central parameter of the 
theory developed in Efron (2007a). Equations (3.15)-(3.16) there are equiv- 
alent to (5.12) here. In both papers, a has the very convenient feature of 
summarizing the effects of an enormous m x m correlation matrix ^ in a 
single number. 

Remark 6.8 for simulation (5.11)]. The X* simulation used in Figure 
5 began with m x n matrix Y = (yij), 

(6.10) = cij + eij, j ^ „jJ(q ^ (ah independent), 
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where I = 1,2,3,4,5 as i is in the first, second, . . . , last fifth of 1 through 
m; Y was then column standardized to give X* , so that ^ had a block form, 
with large positive correlations (about 0.61) in the (m/5) x (to/5) diagonal 
blocks. The choice 7 = 1.23 was required to yield a = 0.241. 

Remark 6.9 (Bilinear statistics). Since A~(A,A( 2 )/m) (4.8), it is 
clear that E{f 2 } = r 2 in Corollary (4.20). The variance calculation proceeds 
as in Theorem 2: 

var{f 2 } = J2 J2 A fk,lh w j w kWiw h /m 

jk Ih 

= J2J2i A jAkh + A jh Aki]wjW k wiw h /m 

jk Ih 

jl kh jh kl 

= 2(^2 ^-jkWjWij y/m = 2r 4 /m. 
^ jk ' 

The verification of (5.18) is the same, except with element bj k of B re- 
placing WjW k above, bih replacing wiw^, etc. 



(6.11) 



77! 
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